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The expansion of biometric applications and databases is worrying. 
Processing extensive or sophisticated biometric data results in longer wait 
times, which might restrict application usefulness. This work focuses on 
accelerating the processing of biometric data and proposes a parallel method 
of data processing that exceeds the capabilities of a central processing unit 
(CPU). The combination of the graphics processing unit (GPU) and compute 
unified device architecture (CUDA) results in at least three times the 
processing speed of a published accurate and secure multimodal biometric 
system. The GPU-assisted approach beats the CPU-only implementation 
when saturating the CPU-only performance with more people than the 
available thread count. The GPU-assisted solution is also proven to have the 


architecture same accuracy as the original system, indicating accuracy and processing 
Face detection performance improvements in the demanding big data environment. 
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1. INTRODUCTION 

In any country whose population is growing exponentially, the size of government databases of 
common biometrics such as face and fingerprints have the same metrics [1]. This means that achieving 
reasonable response time from biometric access control systems nationwide is becoming increasingly 
challenging from a data processing perspective when using only (CPU) solutions. Moreover, the popularity of 
access control applications that rely on big data has extended beyond government applications. In fact, more 
than one study recently reported citizens open a bank account by simply taking a picture using their mobile 
application, which verifies the user using a combination of facial biometrics and personal information [2]-[4]. 

For experimental purposes, it is assumed that real-time performance refers to a response time of below 
three seconds. The best effort in obtaining real-time performance requires utilizing a given hardware 
architecture as efficiently as possible. The graphical processing unit (GPU) differs from many consumer CPUs 
in that it prefers to process a large number of instructions simultaneously over a single education's performance. 
However, this requires careful memory management and algorithm planning. 

Furthermore, contemporary GPUs include hundreds of cores and can operate thousands of threads 
continuously, far more than top-tier consumer CPUs, which contain between four and twelve cores. The Nvidia 
GeForce GTX 1050 Ti is an example of a low-end GPU with 768 cores and the same computing capacity as 
the 3584-core 1080 Ti. GPU-assisted methods may handle computational challenges in a rising variety of large 
data biometric applications, even though CPU and GPU core performance cannot be directly compared. Due 
to its efficient utilization of low-level memory operations and instruction sets, the compute unified device 
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architecture (CUDA), native to Nvidia GPUs, has significantly boosted the processing performance of diverse 
computer vision systems. However, GPU implementation research is often restricted to single biometrics or 
multimodal biometric systems that combine modalities with minimal acquisition synergy. 

Brown and Bradshaw [5], an accurate face and iris multimodal biometric system solved several 
significant information security challenges in mobile and other digital platforms. Because these two biometrics 
may be gathered concurrently at a distance using a near-infrared sensor [6], the user experience is comparable 
to that of using a single biometric but enhanced security and accuracy. This acquisition approach may minimize 
the time required to transport network data and memory, a significant bottleneck in GPU-assisted biometric 
systems and other systems using GPU processing. 

This research examines the processing performance aspect of the face and iris biometric modalities 
and their fusion by reimplementing in [5] on the CUDA platform. More modalities, such as the hand and 
fingerprint, may be handled in this manner. However, because of space constraints, they are not discussed in 
this study. The reimplementation includes the optimization of preprocessing and feature detection as well as 
the classification process, such that a maximum number of concurrent individuals are identified per given 
duration of time. The computational performance of the CPU-only and GPU-assisted implementations of the 
proposed multimodal biometric system will be compared. The goal is to improve the speed at which biometric 
data is processed. Still, the accuracy of the GPU-assisted implementation will also be checked to make sure 
that it is fair to compare it to the original CPU version. 

The rest of the paper is organized as follows: Section 2 presents related faces, iris, fused systems, and 
biometric systems that operate on the CUDA framework. Section 3 explains how the CUDA framework is 
utilized to improve processing speed. Section 4 discusses the construction of the system. The experimental 
analyses and results are discussed in section 5. Section 6 concludes the paper and outlines future work. 


2. RELATED RESEARCH 

Many previous studies dealt with the subject of the iris and the face and their common characteristics. 
These researchers have investigated which information may be particularly important for the processing of face 
identity. This section review results from recent studies. 


2.1. Characteristics of face and iris 

Face segmentation eliminates dynamic elements such as an individual's hair and backdrop. The 
difficulty of consistently deleting these emotional traits during segmentation has been extensively studied. 
Karras et al. [7], the author suggested a new way to divide a face using gradient boosting to learn an ensemble 
of regression trees. The system is accurate and significantly exceeds real-time performance at less than | 
millisecond per image on a single CPU. The facial landmarks are automatically generated from a sparse subset 
of image intensities. In addition, the system manages incomplete or partially labeled data. The resultant facial 
landmark coordinates are utilized to segment the face. 

A new method by Hezil and Boukrouche [8] successfully executes facial recognition with a single 
training sample without 3D modeling or deep learning, which may exacerbate the massive data processing 
challenge. Their method tries to frontal all segmented faces using a basic mesh for each person without face 
detection preprocessing. The problem with their approach is that it takes more time because it uses 40 Gabor 
filters with 8 different orientations and 5 different scales. 

On the other hand, the iris is known as the most accurate external biometric, but this can only be done 
using sensors close to the eye [9]. Improving iris identification while employing sensors at a distance is a 
difficulty. There are two primary ways of segmenting the iris: the integral operator and the hough transform 
(HT) [10]. Calculating the maximum in the blurred derivative concerning the rising radius of normalized 
contour integrals along circular trajectories, the Integrodifferential operator performs an exhaustive search for 
the iris's center and radius. In contrast, HT utilizes binary edge maps to identify pupil and iris borders. To 
estimate the dimensions of the concentric border circles, votes are tallied. 

Chaki et al. [11] used an HT algorithm appropriate for iris extraction. Restricted circular hough 
transformation (RCHT) is a method that looks for circles bordered by the upper and lower eyelids. The 
extracted texture inside two concentric rings constitutes the iris's feature vectors. The more significant vector, 
composed of the feature vectors for each eye, is classified using support vector machines (SVMs). 


2.2. Iris and the face merge 

Ammour et al. [12] devised a thorough face and iris fusion approach at the feature level. All 
throughout the preprocessing stage, they stress the significance of feature alignment. Extensive experiments 
show that alignment significantly improves recognition accuracy when using biometric data. The two eye 
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sockets serve as a dividing line to split the face in half and provide a vertical axis. Iris alignment is done by 
cutting the iris and rotating each half to the right place for the face's posture. 

The accuracy was verified using the CASIA-Iris-Distance dataset, which comprises close-up near- 
infrared face images. It stores near-infrared images of the face, which need a memory transfer to the GPU 
device to decompose into face and iris representations. The analysis focused on feature alignment and limiting 
false matches, as determined by a false acceptance rate (FAR) of 0.01%. The alignment of both irises improved 
verification accuracy by 44%. Aligning the faces enhanced verification accuracy by 7%. Furthermore, 
combining the face and both irises increased accuracy by 14%. The actual acceptance rate (GAR) for the face 
was 87.33%, 71.55% for fused irises, and 94.44% for all three fused irises. For more detiales see [13], [14]. 


2.3. Instances of iris using GPU 

Using the CASIA iris dataset, Jamaludin et al. [15] have shown the effectiveness of GPU parallel 
processing. The noise was reduced in the iris segmentation by omitting the whole iris and confining the upper 
and lower eyelid areas. This yielded a GAR of around 96%. The iris recognition technique takes 719 ms to 
complete, which is 200ms faster than the existing gold standard. All image processing methods were executed 
on the GPU, which was twice as fast as the CPU. 


3. GPU-AIDED COMPUTING UTILIZING CUDA 

CUDA offers rapid parallel programming and enables general-purpose GPU to compute (GPGPU). 
Using a cluster of multiprocessors with hardware-level task scheduling, as in the Single Instruction Multiple 
Threads architecture, makes this feasible. This is ideal for running the same program instruction on a large 
population's biometric data all at once. The shared device memory is how the multiprocessors talk to each other. 
All processor cores can use it, but it depends on the central processing unit and main memory to work [16], [17]. 

The CUDA framework's software component advances the C/C++ programming language [18], [19]. 
Planning a GPU-assisted biometric system from the perspective of a CUDA programmer needs the utilization 
of both host code and device code running on the CPU and GPU, respectively. This research generates host 
code for algorithms like Gaussian blur that exhibit little data parallelism or run more slowly on the device due 
to low complexity. On the other hand, algorithms with a lot of data parallelism are loaded into the device's 
memory and performed in device code. 

The left side (host) of Figure 1 illustrates the host's organization of device code into kernels for low- 
level hardware communication [20]. By making a kernel call on the host, we may have a piece of code running 
on the device that can be called again, treated as a standalone function, and used with new data each time. The 
ultimate result is a novel biometric system that runs in numerous concurrent threads in the device's memory, 
despite its sequential origins. A programmer must still explicitly declare these memory allocations. However, 
the computer vision algorithms are often configured to automatically specify memory allocations, except in 
the classification phase and during the processing of many users simultaneously. 


CPU Computer GPU Computer 


Thread Execution Control Unit 
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Figure 1. Shows the CUDA framework 
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Hardware threads are stored in the multiprocessors and run in a coordinated fashion as a set of threads 
inside a warp. Warps are 32-threaded hardware-managed supercomputers. However, different threads in device 
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memory may be necessary for machine vision and other GPGPU applications. This is particularly true when 
doing a parallel analysis of biometric data from different people. Each block, as shown at the bottom of Figure 1, 
groups threads and may accommodate up to 1,024 threads. A hardware scheduler coordinates the coordinated 
operation of a group of warps inside a single block. This could be called a zero-latency scheduler because it 
can hide memory bottlenecks in the pipeline and communication delays between devices by using warp 
switches in the right places [21], [22]. 

The memory transfer overhead is introduced when the computed result is returned to the host's 
memory. This becomes obvious inside code loops where the transfer is happening and is thus not a 
"recommended practice" [22]. The relative processing performance of the CPU-only and GPU-assisted 
solutions is predicted to significantly affect the number of selected people. In this research, we are trying to 
find ways to make it easier for a GPU-assisted implementation to transfer memory: 

a) Since both the face and the iris can be easily acquired as a single data image and put into device memory for 
training and testing (as explained in the following two bullet points), we may use their complementary nature. 

b) Before using computing with the help of a GPU, a certain number of people are put into the device's 
memory so they can be used to train the computer to put data into categories. 

c) Before data classification testing, the data model is loaded into the device's memory. GPU-assisted 
computing is then used to make predictions about the class of a sample set of people. 

d) The gadget does as much computing as possible before sending the results back to the host. 

e) Since the preparation procedures are the first to be performed, the host machine may be used to carry them out. 


4. THE SUGGESTIVE SYSTEM 

This testing was performed on a computer equipped with an Intel 6400 quad-core CPU and an Nvidia 
GTX 1050 Ti GPU. The C/C++ code for the CPU-only version uses OpenCV and Dlib for image processing 
and Liblinear for machine learning. The CUDA libraries are used in the GPU-accelerated version. The C++ 
Chrono high-resolution clock technique measured CPU and GPU timings. This approach demonstrates a 
multimodal biometric system that is fast and accurate. Figure 2 shows the outlines of the method by dividing 
the face from the iris to show how the two may be processed in parallel, this figure is used a lot to offer the 
different stages of the system. 


Figure 2. Shows the proposed face and iris recognition solution 


4.1. Steps before detecting features 

As part of the preprocessing phase, a Gaussian blur with a 5 X 5 kernel is performed on the whole 
image, including the face and iris. Given that the face and iris algorithms are separate, they may be run in 
parallel, as shown in Figure 2. During preprocessing, a single image undergoes several other elementary 
processes, such as scaling and pixel normalization, which run more quickly on the CPU than on the GPU. The 
initial objective is to identify the eyeballs from the facial areas sent to the GPU. Thus, the face and iris may be 
processed in parallel after loading only once. Haar object detection [23] is used to isolate the eyes from the rest 
of the face, and it runs twice as quickly on the GPU as it does on the CPU. Additionally, the transmission time 
between the host and device memory requires around 60us per image. 
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Detecting features of face and iris: The first step is to settle on a target area region of interest (Rol). 
Facial landmark detection utilizes a linear support vector machine and a sliding window classifier based on the 
histogram of Gaussian (HoG) characteristics. face detection is done by going back and forth along the x-axis 
by 5 degrees in both directions until a match is found. 

Figure 2 shows the results of the model-based detection of 68 interpolated landmarks inside the 
original face RoI. The landmark model employed in this work was trained on the iBUG 300-W face annotated 
dataset in the form of a cascade of regression trees [24]. The time complexity is decreased for training and 
testing the landmark detector by computing the transform and warping it just once at each level of the cascade. 
Furthermore, the exercise is only completed once, enabling the landmark detector to anticipate unknown faces 
included in distinct datasets. A graphics processing unit to help identify HoG characteristics sped up the process 
by order of magnitude. Because the complete face detector runs in less than a millisecond per person, the 
memory copy to and from the GPU device decreases this speedup from 800% to 50%. 

When capturing the Rol around the pupil in a visual image, the RCHT technique is applied to the input 
data [9]. A Laplacian of Gaussian (LoG) filter with 80 and 1900 kernels improves the RCHT method's 
consistency. The iris border is therefore identified by comparing the sclera with the iris area below or horizontal 
to the pupil, depending on the boundary threshold. This barrier is designed for those who stare over to the side 
or have to squint. The concentric circles at the edges represent the solution. The GPU-assisted version of RCHT 
resulted in a spectacular 50-fold performance gain. 


4.2. Matching features 

We employ the confidence score generated by the fast local binary pattern histogram (LBPH) 
approach for failsafe feature alignment in conjunction with a sliding window. High-security settings allow for 
a more stringent rejection of low-quality data by adjusting this threshold. Before resorting to the fallback 
strategy, the following feature alignment techniques are used on the face and the iris, respectively. However, 
because of the already fast computation time on the CPU, none of these techniques are run on the GPU. 

Following the identification of critical anatomical landmarks, a facial mesh is generated. The face 
mesh is built using delaunay triangulation so that no landmark is within the circumcircle of any triangle [25]. 
The leftward-posing mesh is brought into a frontal orientation, as seen in Figure 2. faces at angles of up to 30 
degrees may be quickly formalized using this approach. 

An inverted Gaussian filter with an 11 x 11 kernel is used to enhance the sharpness of the iris texture 
included inside the concentric circles that make up the Rol. To again accommodate squinting and off-center 
eye gazing, the top and bottom iris areas are trimmed near the margin of the pupil. However, before the data is 
pruned, the iris texture is aligned by rotating the image of the query image according to the database image. 
This allows for the alignment of textures that are not obscured. 


4.3. Extracting features 

The preceding procedures eliminated intra-class variance, which is necessary for adequate 
classification. However, it is also essential to maximize inter-class variance because of the role it plays in 
minimizing false matches. In the GPU-assisted version of the system, the feature extraction process is not run 
on the GPU. 

The studies [25]-[27] have been shown to work well together to maximize inter-class variance for 
image-based biometrics. The LoG filter eliminates undesirable characteristics in the low and high-frequency 
spectrum before boosting the remaining features, thus improving the mean signal. The Gaussian and Laplacian 
kernels were 15 X 15 and 7 xX 7, respectively. This was applied to all of the normalized and segmented data. 
Using the local line binary pattern (LLBP) operator, we can attenuate brightness variations without increasing 
background noise. The final images are downsized to 75 X 75 pixels so they may be used in categorization. 


4.4. Classification 

The eigenfaces approach [28] is used by the k — NN classifier, and it is the most effective method for 
image-based biometrics. This classifier uses a linear combination of characteristics to optimize the overall 
variance in data. The first five main components used to create a model include the majority of the variation in 
the data. This is a common technique that effectively decreases background noise in data. During the prediction 
phase on the GPU, many threads run at the same time to handle a large number of samples. The total scatter 
matrix for a set of N sample images of type a, is defined as: 


N 


X, = D(a, -m)(a, =m) 


k=1 


where m € RN is the mean image obtained from the samples. 
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5. EVALUATION AND RESULTS OF AN EXPERIMENT 

Below, especially in section 2, we analyze and compare the results of two experiments designed to 
evaluate an at-a-distance dataset's accuracy and processing speed. Related state-of-the-art research is filled 
with challenging datasets, as was described. When assessing a case, GAR is always used. It has been 
established that FAR equals 100% of GAR for verification purposes. 


5.1. Datasets from experiments 

We utilized the following dataset for empirical testing: The experiments presented in the previous 
section employ the tough CASIA-Iris-Distance dataset. Information was collected from a distance of three 
meters using a sensor with a resolution of 2.352 Xx 1.728 designed and built by the authors. Because the iris 
worked well on this data set, it shows that it can be a powerful biometric that is easy for the user to use. 

These categories comprise the CASIA-Iris-Distance dataset: 90 people are chosen compared to 
Ammour et al. [12]. This information will be used as a trial run for near-infrared camera sensors built into 
smartphones and other digital devices that can outsource processing to a central server. In a group of 90 people, 
five samples are utilized in instruction and five in evaluation. 


5.2. The experiment in accuracy 

Verification ensures authenticity by utilizing k-nearest neighbor (k-NN) to determine whether a 
training sample from a known class and a test sample has an eigenvalue distance below a threshold. Figure 3 
shows how the suggested face and iris systems fared compared to Ammour et al. [12] using the CASIA-Iris- 
Distance dataset for verification purposes. The verification is expected to be accurate to 0.01% FAR. The 
suggested approach improved the results for the face, the two irises, and the combination of the three. The 
improvement in the irises is significantly more pronounced than that on the face. The iris algorithm's design 
with long-range functionality is responsible for this. 


5.3. Test of computer efficiency 

The second verification round is performed, but this time on the GPU-assisted implementation. The 
results were presented in Figure 3, meaning that the system successfully moved to the CUDA environment. 
Figure 4 compares the processing speeds of the CPU-only and GPU-assisted variants of the same method as 
well as the spee-up. The speedup is around 50 times larger with the HoG filter face detection approach. This is 
followed by the RCHT, a Hough-based algorithm, and Haar. The 588ms execution time was faster than the 
GPU iris system that Jamaludin ef al. [15] came up with. 

Due to time restrictions, the training portion of classification was not moved to the GPU. Prediction 
speed for a single person is comparable to the CPU and GPU versions. Figure 5 shows that the transmission time 
between the device and host is about 150 milliseconds, multiplied by the number of subjects tested (1, 4, 6, 10, or 
18). An exponential trend is noticed for the prediction phase after depleting the (four) thread count on the CPU. 
This demonstrates that the processing speed of biometric data may be determined, in part, by the number of 
physical cores available in a computer rather than by software-level threads in a parallel processing architecture. 
The GPU-assisted version achieves real-time performance while concurrently categorizing (including all 
accumulated timings) up to 18 people. This is a promising result, particularly given the usage of a low-end GPU. 
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Figure 3. Shows the results of validation tests conducted on the CASIA-iris-distance dataset 
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Figure 4. Shows the comparison of the processing speeds of algorithms that just use the CPU and those that also 
use the GPU (in milliseconds) with the speed-up 
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Figure 5. Shows the comparison of the speed of classifier prediction when using only the CPU or GPU- 
assisted versions (in milliseconds) 


6. CONCLUSIONS AND FUTURE WORK 

Using many biometric modalities increases security but may slow down processing time for large data 
sets. This article demonstrates a dramatic increase in the identification accuracy of both authentic and imposter 
users, making it suitable for usage in high-security settings. Studies have proven that remote iris recognition is 
more accurate than the more conventional facial identification method. The suggested method significantly 
improves recognition accuracy compared current gold standard iris verification system. Combining facial and 
iris features led to near-perfect accuracy and additional protection against counterfeiting. Facial and iridescent 
traits may be quickly acquired and supplemented in biometric systems. They also make it easier to send data, 
which is essential for getting information to the GPU as quickly as possible. 

To take advantage of the GPU's superior single-instruction speed, a GPU-accelerated variant had to 
be developed by reimplementing the original system to avoid performing basic operations. Overall 
performance is improved when it is realized that not all algorithms need to be run on the GPU. In addition, the 
HoG algorithm with GPU support improved processing performance by 50 times, the most of any single 
method. The GPU's overall performance went up by a factor of three when only one subject was being looked 
at. But the benefit multiplies as more and more people are evaluated at once. 

The GPU's high thread count enabled a 24-fold boost while assessing 18 people. Last but not least, it 
was determined that the GPU-assisted version was just as accurate as the original system, which thus illustrates 
the effective reimplementation of a reliable biometric system on a GPU. We want to research more 
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improvements for the prediction phase and migrate the training phase over to the CUDA framework shortly. 
Other classifiers will be added and adapted to the CUDA framework as well. Finally, an iris recognition 
technology that is more effective in parallel, such as a HoG filter with regression trees, may be utilized in place 
of Haar cascades. 
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